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Abstract: Protein-protein interactions (PPIs) play a key role in many cellular processes. 
Unfortunately, the experimental methods currently used to identify PPIs are both 
time-consuming and expensive. These obstacles could be overcome by developing 
computational approaches to predict PPIs. Here, we report two methods of amino acids 
feature extraction: (i) distance frequency with PCA reducing the dimension (DFPCA) and 
(ii) amino acid index distribution (AAID) representing the protein sequences. In order to 
obtain the most robust and reliable results for PPI prediction, pairwise kernel function and 
support vector machines (SVM) were employed to avoid the concatenation order of two 
feature vectors generated with two proteins. The highest prediction accuracies of AAID and 
DFPCA were 94% and 93.96%, respectively, using the 10 CV test, and the results of 
pairwise radial basis kernel function are considerably improved over those based on radial 
basis kernel function. Overall, the PPI prediction tool, termed PPI-PKSVM, which is freely 
available at http://159.226.118.31/PPI/index.html, promises to become useful in such areas 
as bio-analysis and drug development. 

Keywords: amino acid distance frequency; amino acid index distribution; protein-protein 
interaction; pairwise kernel function; support vector machine 
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1. Introduction 

Protein-protein interactions (PPIs) play an important role in such biological processes as host 
immune response, the regulation of enzymes, signal transduction and mediating cell adhesion. 
Understanding PPIs will bring more insight to disease etiology at the molecular level and potentially 
simplify the discovery of novel drug targets [1]. Information about protein-protein interactions have 
also been used to address many biological important problems [2-5], such as prediction of protein 
function [2], regulatory pathways [3], signal propagation during colorectal cancer progression [4], and 
identification of colorectal cancer related genes [5]. Experimental methods of identifying PPIs can be 
roughly categorized into low- and high-throughput methods [6]. However, PPI data obtained from 
low-throughput methods only cover a small fraction of the complete PPI network, and high-throughput 
methods often produce a high frequency of false PPI information [7]. Moreover, experimental methods 
are expensive, time-consuming and labor-intensive. The development of reliable computational methods 
to facilitate the identification of PPIs could overcome these obstacles. 

Thus far, a number of computational approaches have been developed for the large-scale prediction 
of PPIs based on protein sequence, structure and evolutionary relationship in complete genomes. These 
methods can be roughly categorized into those that are genomic -based [8,9], structure-based [10], and 
sequence-based [11-26]. Genomic- and structure-based methods cannot be implemented if prior 
information about the proteins is not available. Sequence-based methods are more universal, but they 
concatenate the two feature vectors of protein P a and Pb to represent the protein pair P a -Pb, and the 
concatenation order of two feature vectors will affect the prediction results. For example, if we use 
feature vectors x a , x h to represent protein P a and Pt, respectively, then the P a -Pb protein pair can be 

expressed as x ah = x a ® x h , or x ba =x b ®x a . In general, however, x a © x b is not equal to x h © x a . 

Furthermore, PPIs have a symmetrical character; that is, the interaction of protein P a with protein P b 

equals the interaction of protein Pb with protein P a . Under these circumstances, concatenating two 

feature vectors of protein P a and Pt to represent the protein pair P a -P b and then using the traditional 
kernel k(x v x 2 ) to predict PPIs would not be workable. 

Therefore, in this paper, we introduced two kinds of feature extraction approaches, amino acid 
distance frequency with PCA reducing the dimension (DFPCA) and amino acid index distribution 
(AAID) to represent the protein sequences, followed by the use of pairwise kernel function and SVM to 
predict PPI. 

2. Results and Discussion 

LIBSVM [27], loaded from http://www.csie.ntu.edu.tw/~cjlin, is a library for Support Vector 

Machines (SVMs), and it was used to design the classifier in this paper. The kernel program of the 

software was modified to the pairwise kernel functions, which were formed by the RBF genomic kernel 
function i: ( x , , x 2 ) in all experiments. 
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2.1. The Results of DFPCA and AADI with Kn Pairwise Kernel Function SVM 

In statistical prediction, the following three cross-validation methods are often used to examine a 
predictor for its effectiveness in practical application: independent dataset test, K-fold crossover or 
subsampling test, and jackknife test [28]. However, of the three test methods, the jackknife test is 
deemed the least arbitrary that can always yield a unique result for a given benchmark dataset as 
demonstrated by Equations (28)-(30) in [29]. Accordingly, the jackknife test has been increasingly and 
widely used by investigators to examine the quality of various predictors (see, e.g., [30-41]). However, 
to reduce the computational time, we adopted the 10-fold cross-validation (10 CV) test in this study as 
done by many investigators with SVM as the prediction engine. 

The four feature vector sets, Hf, Vf, Pf, and Zf, extracted with DFPCA and the five feature vector 
sets, LEWP710101, QIAN880138, NADH010104, NAGK730103 and AURR980116, extracted with 
AAID were employed as the input feature vectors for K\\ pairwise radial basis kernel function (PRBF) 
SVM. The results of DFPCA and AAID are summarized in Table 1. 



Table 1. Results of DFPCA and AAID with PRBF SVM in 10 CV test. 



Feature Set 


S„ (%) 


PPV{%) 


ACC(%) 


MCC 


Hf 


95.94 ± 1.92 


91.98 ±2.88 


93.78 ± 1.44 


0.8765 


Vf 


95.66 ±2.75 


92.52 ±2.40 


93.96 ± 1.86 


0.8798 


Pf 


95.78 ±2.23 


92.07 ± 1.69 


93.76 ± 1.93 


0.8760 


Zf 


96.06 ± 1.24 


91.71 ±3.13 


93.69 ± 1.86 


0.8747 


LEWP710101 


95.86 ±2.23 


92.08 ±4.32 


93.80 ±2.42 


0.8768 


QIAN880138 


96.06 ±2.83 


92.27 ± 1.50 


94.00 ± 1.22 


0.8808 


NADH010104 


95.82 ±2.98 


92.04 ±2.51 


93.76 ± 1.66 


0.8760 


NAGK730103 


96.06 ±2.83 


92.09 ±4.02 


93.90 ±3.31 


0.8789 


AURR980116 


95.94 ±2.07 


92.33 ± 1.42 


93.98 ± 1.24 


0.8804 



From Table 1, we can see that the performances of the two feature extraction approaches, i.e., 
amino acid distance frequency with PCA (DFPCA) and amino acid index distribution (AAID), are 
nearly equal when using the K n pairwise kernel SVM. The total prediction accuracies are 93.69%~94%. 
As previously noted, we used just five amino acid indices, including LEWP710101, QIAN880138, 
NADH010104, NAGK730103 and AURR980116, to produce the feature vector sets. When we tested 
the performance of AAID against the remaining 480 amino acid indices from AAindex, we found that 
the amino acid index does affect predictive results and that the total prediction accuracies of those amino 
acid indices were 79.4%~94%. Among our original five indices, as noted above, the performance of 
AAID was superior in comparison to the results from AAindex. To account for the better performance of 
our five indices, we point to the physicochemical and biochemical properties of amino acids. By 
single-linkage clustering, one of agglomerative hierarchical clustering methods, Tomii and Kanehisa [42] 
divided the minimum spanning of these amino acid indices into six regions: a and turn propensities, 
P propensity, amino acid composition, hydrophobicity, physicochemical properties, and other properties. 
The indices of LEWP710101, QIAN880138, NAGK730103 and AURR980116 are arranged into the 
region of a and turn propensities, while NADH010104 is arranged into the hydrophobicity region, 
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indicating that the properties of a and turn propensities, and hydrophobicity contain more 
distinguishable information for predicting PPIs. 

2.2. The Comparison ofPairwise Kernel Function with Traditional Kernel Function 

In order to evaluate the performance of pairwise kernel function, we compared the results of pairwise 
radial basis kernel function (PRBF) and radial basis function kernel (RBF) with the same feature vector 

sets. For RBF, we concatenate the two feature vectors of protein P a and protein Pb to represent the 
protein pair P a - Pb, that is, feature vector x ah = x a ®x h was used as the input feature vector of RBF. The 

results of RBF and PRBF with DFPCA in the 10CV test are listed in Table 2. 



Table 2. Results of RBF and PRBF with DFPCA in the 10 CV test. 



Feature Set 


Kernel Function 


S„ (%) 


PPV(%) 


ACC(%) 


Hf 


RBF 


89.96 ±0.52 


89.65 ±2.17 


89.88 ± 1.05 


PRBF 


95.94 ± 1.92 


91.98 ±2.88 


93.78 ± 1.44 


Vf 


RBF 


90.20 ± 1.31 


89.33 ±2.60 


89.72 ± 1.72 


PRBF 


95.66 ±2.75 


92.52 ±2.40 


93.96 ± 1.86 


Pf 


RBF 


89.32 ±0.86 


89.26 ±2.91 


89.28 ± 1.44 


PRBF 


95.78 ±2.23 


92.07 ± 1.69 


93.76 ± 1.93 


Zf 


RBF 


90.84 ± 1.85 


88.79 ±2.50 


89.64 ± 1.18 


PRBF 


96.06 ± 1.24 


91.71 ±3.13 


93.69 ± 1.86 



Table 2 shows that the performance of PRBF is superior to that of RBF for predicting PPL The total 
prediction accuracies of PRBF are higher at 3.9%~4.48% than those of RBF. 

2.3. The Comparison of DF and DFPCA Feature Extraction Approaches 

For the feature extraction approach of distance frequency of amino acids grouped with their 
physicochemical properties, we compared the results of DF and DFPCA with PRBF SVM to test the 
validity of adopting PCA. The reduced feature matrix is set to retain 99.9% information of the original 
feature matrix by PCA. The results of DF and DFPCA with PRBF SVM in the 10CV test are listed in Table 3. 



Table 3. Results of DF and DFPCA with PRBF SVM in the 10 CV test. 



Feature Set 


Feature Extraction Approach 


S„ (%) 


PPV{%) 


ACC(%) 


MCC 


Hf 


DF 


97.37 ±2.55 


66.67 ±27.8 


74.34 ±24.3 


0.5485 


DFPCA 


95.94 ± 1.92 


91.98 ±2.88 


93.78 ± 1.44 


0.8765 


Vf 


DF 


97.21 ±2.39 


71.40 ±23.0 


78.17 ± 27.1 


0.6093 


DFPCA 


95.66 ±2.75 


92.52 ±2.40 


93.96 ± 1.86 


0.8798 


Pf 


DF 


97.13 ±4.70 


69.48 ±25.5 


77.23 ±27.2 


0.5937 


DFPCA 


95.78 ±2.23 


92.07 ± 1.69 


93.76 ± 1.93 


0.8760 


Zf 


DF 


97.65 ±4.82 


62.29 ±29.5 


69.26 ±23.6 


0.4680 


DFPCA 


96.06 ± 1.24 


91.71 ±3.13 


93.69 ± 1.86 


0.8747 



From Table 3, we can see that the performance of DFPCA is superior to that of DF. The total 
prediction accuracies and MCC (see Equation (16) below) of DFPCA are 15.79%~24.43% and 
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0.2705-0.4067 higher than those of DF, respectively. Although the sensitivities of DF are a little higher 
(1.43%~1.59%) than those of DFPCA for the Hf, Vf, Pf and Zf feature sets, the positive predictive 
values are much less than that of DFPCA (21%~29%), which means that the DFPCA approach can 
largely reduce the false positives. These results show that the performance of DFPCA is superior to that 
of DF for predicting PPL It should be noted that feature vectors generated with either DF or DFPCA 
contain statistical information of amino acids in protein sequences, as well as information about amino 
acid position and physicochemical properties. 

2.4. The Performance of the Predictive System Influenced by Randomly Sampling the Noninteracting 
Protein Subchain Pairs 

To investigate the influence of randomly sampling the noninteracting protein subchain pairs, 
we randomly sampled 2510 noninteracting protein subchain pairs five times to construct five negative 
sets, and we used the DFPCA approach with hydrophobicity property to predict PPI in the 10CV test. 
The results, as shown in Table 4, indicate that random sampling of the noninteracting protein subchain 
pairs in order to construct negative sets has little influence on the performance of the PPI-PKSVM. 

Table 4. Effect of random sampling of the noninteracting protein subchain pairs on the 
performance of PPI-PKSVM with DFPCA and PRBF SVM in the 10CV test. 



Sampling Time 


S„ (%) 


PPV{%) 


AAC(%) 


MCC 


1 


95.38 ±3.35 


91.20 ±3.37 


93.09 ±3.45 


0.8627 


2 


95.42 ± 1.39 


91.52 ±3.24 


93.29 ± 1.65 


0.8665 


3 


95.46 ±3.03 


91.21 ± 1.63 


93.13 ±2.29 


0.8635 


4 


95.46 ±3.03 


91.49 ± 1.70 


93.29 ±2.13 


0.8666 


5 


95.94 ± 1.92 


91.98 ±2.88 


93.78 ± 1.44 


0.8765 



2.5. Comparison of Different Prediction Methods 

To demonstrate the prediction performance of our method, we compared it with other methods [25] 
on a nonredundant dataset constructed by Pan and Shen [25], in which no protein pair has sequence 
identity higher than 25%. The number of positive links, i.e., interacting protein pairs, is 3899, which is 
composed of 2502 proteins, and the number of negative links, i.e., noninteracting protein pairs, is 4262, 
which is composed of 661 proteins. Among the prediction results of different methods shown in Table 5, 
the performance of PPI-PKSVM stands out as the best. When compared to Shen's LDA-RF, the 
accuracy (see Equation (15) below) and MCC of LEWP710101/QIAN880138 and Hf-DFPCA are 
respectively 1.9%, 2%, 0.038 and 0.039 higher. These results indicate that our method is a very 
promising computational strategy for predicting protein-protein interaction based on the protein sequences. 
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Table 5. Performance comparison of different PPI methods using Shen's dataset a in the 
10 CV test. 



Method 


S„ (%) 


S„ (%) 


ACC(%) 


MCC 


LEWP710101 


97.3 ± 0.04 


99 


.2 ± 0.04 


98.3 ±0.00 


0.966 ±0.0006 


QIAN880138 


97.3 ±0.10 


99 


.1 ±0.10 


98.3 ±0.10 


0.966 ±0.002 


NADH010104 


97.2 ±0.07 


99 


.2 ± 0.04 


98.3 ±0.05 


0.965 ± 0.0007 


NAGK730103 


97.2 ±0.06 


99 


.2 ± 0.04 


98.2 ±0.06 


0.965 ± 0.0004 


AURR980116 


97.3 ± 0.04 


99 


.1 ±0.06 


98.2 ±0.06 


0.965 ± 0.0006 


Hf-DFPCA 


97.6 ±0.20 


99 


.1 ±0.10 


98.4 ±0.10 


0.967 ±0.002 


Vf-DFPCA 


97.5 ±0.10 


98 


.9 ± 1.00 


98.3 ±0.80 


0.965 ± 0.007 


Pf-DFPCA 


96.9 ±0.10 


99 


.5 ±0.60 


98.2 ±0.60 


0.964 ± 0.004 


Zf-DFPCA 


97.9 ±0.90 


96 


.0±0.20 


96.9 ± 1.10 


0.939 ±0.002 


LDA-RF b 


94.2 ± 0.40 


98. 


0±0.30 


96.4 ±0.30 


0.928 ± 0.006 


LDA-RoF b 


93.7± 0.50 


97. 


.6 ±0.60 


95.7 ±0.40 


0.918 ±0.007 


LDA-SVM b 


89.7 ± 1.30 


91. 


.5 ±1.10 


90.7 ±0.90 


0.813 ±0.018 


AC-RF b 


94.0 ± 0.60 


96. 


.6 ±0.40 


95.5 ±0.30 


0.914 ±0.007 


AC-RoF b 


93.3 ±0.70 


97. 


1 ±0.70 


95.1 ±0.60 


0.910 ±0.009 


AC-SVM b 


94.0 ± 0.60 


84. 


.9 ±1.70 


89.3 ±0.80 


0.792 ±0.014 


PseAAC-RF b 


94.1 ±0.90 


96. 


9 ±0.30 


95.6 ±0.40 


0.912 ±0.007 


PseAAC-RoF b 


93.6 ±0.90 


96. 


,7 ±0.40 


95.3 ±0.50 


0.907 ±0.009 


PseAAC-SVM b 


89.9 ±0.70 


92. 


0±0.40 


91.2 ±0.4 


0.821 ±0.006 



a Shen's dataset contains two subdatasets, C and D, which are available at http://www.csbio.sjtu.edu.cn/bioinf/ 
LR PPI/Data.htm; b These results are taken from Table 4 of the literature [25]. 

3. Experimental Section 

3.1. Dataset 

To construct the PPI dataset, we first obtained the subchain pair name of PPIs from the PRISM 
(Protein Interactions by Structural Matching) server (http://prism.ccbb.ku.edu.tr/prism/), which was 
used to explore protein interfaces, and we downloaded the corresponding sequences of these protein 
subchain pairs from the Protein Data Bank (PDB) database (http://www.rcsb.org/pdb/). According to 
PRISM [43], a subchain pair is defined as an interacting subchain pair if the interface residues of two 
protein subchains exceed 10; otherwise, the subchain pair is defined as a noninteracting subchain pair. 
For example, suppose a protein complex has A, B, C and D subchains. If the interface residues of AB, 
AC, and BD subchain pairs total more than 10, while the interface residues of AD, BC and CD subchain 
pairs total less than 10, then the AB, AC, and BD subchain pairs are treated as interacting subchain pairs, 
while the AD, BC and CD subchain pairs are treated as noninteracting subchain pairs. All interacting 
protein subchain pairs were used in preparing the positive dataset, and all noninteracting subchain pairs 
were used in preparing the negative dataset. To reduce the redundancy and homology bias for 
methodology development, all protein subchain pairs were screened according to the following 
procedures [15]. (i) Protein subchain pairs containing a protein subchain with fewer than 50 amino acids 
were removed; (ii) For subchain pairs having >40% sequence identity, only one subchain pair was kept. 
The >40% determinant may be understood as follows. Suppose protein subchain pair A is formed with 
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protein subchains Al and A2 and protein subchain pair B is formed with protein subchains B 1 and B2. If 
sequence identity between protein subchains Al and Bl and A2 and B2 is >40%, or sequence identity 
between protein subchains Al and B2 and between A2 and Bl is >40%, then the two protein subchain 
pairs are defined as having >40% sequence identity. In our method, we would only retain those subchain 
pairs having <40% sequence identity. After these screening procedures, the resultant positive set was 
comprised of 2510 interacting protein subchain pairs, while the resultant negative set contained many 
noninteracting protein subchain pairs. To avoid unbalanced data between the positive and negative sets, 
we randomly sampled the 2510 noninteracting protein subchain pairs to construct the negative set. 
Finally, a PPI dataset consisting of 2510 PPI subchain pairs and 2510 noninteracting protein subchain 
pairs was constructed. 

3.2. Distance Frequency of Amino Acids Grouped with Their Physicochemical Properties 

The frequency of the distance between two successive amino acids, or distance frequency, was used 
to predict subcellular location by Matsuda et ah, [44] and can be described as follows: For a protein 
sequence P, the distance set d A between two successive letters (e.g., A) appearing in protein sequence P 
can be represented as: 

d A ={d 1 ,d 2 ,...,d i ,...,d nA _ l } i = \,...n A -\ (i) 

where n A is number of letter As appearing in protein sequence P, d t is the distance from the zth letter^ to 
the (i + l)th letter .4 , and d t is calculated in a left-to-right fashion. The distance frequency vector for letter 
A can be defined by the following equation: 

f A =[N lt N 2t - t N Jt -N m ] (2 ) 

where Nj represents the number of times that the y'th distance unit appears in the d A set. For example, 
considering the protein sequence AACDAMMADA, the distance sets of letters A, C, D and Mare shown 
respectively as 

d A = {1,3,3,2}, d c = {0},d D = {5},d M = {1} 

As a result, the corresponding distance frequency vectors are shown respectively as 
Df A = [1, 1, 2, 0, 0],D/ C = [0, 0, 0, 0, 0], Df D = [0, 0, 0, 0, l],Df M = [1, 0, 0, 0, 0] . The other 1 6 basic amino acid 

distance frequency vectors are zero vector, or V= [0,0,0,0,0]. Thus, we can use the feature vector x to 
encode the protein sequence P: 

x = [Df A ,Df c ,Df D ,-,Df Y ] 

In this work, we used the concept of distance frequency [44] and borrowed Dubchak's idea of 
representing the amino acid sequence with four physicochemical properties [45] to encode the protein 
subchain sequence. First, according to the amino acid value given by such physicochemical properties as 
hydrophobicity [46], normalized van der Waals volume [47], polarity [48] and polarizability [49], the 
20 natural amino acids can be divided into three groups [45], as listed in the Table 6. For Hydrophobicity, 
Normalized van der Waals Volume, Polarity and Polarizability, the amino acids in Group 1, Group 2 and 
Group 3 were expressed as Hi, H2, Hy, V\, V2, Vy, Pi, P2, P3', and Z\, Z 2 and Z 3 , respectively. Second, each 
protein subchain sequence was then translated into the appropriate three-symbol sequence, depending on 
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the particular physicochemical property, be it H\- 3 , V\- 3 , P1-3, or Z1-3. For example, suppose that the 
original protein sequence is MKEKEFQSKP. Then, by the set of symbols denoted above, in this case, 
hydrophobicity, this sequence can be translated into H 3 H\H\H\H\H 3 H\H 2 H\H 2 , and the same would be 
true for V\- 3 , P1-3, or Zi_ 3 . Third, the distance frequency of every symbol in the translated sequence was 
computed. In the above example, the H\, H 2 , H 3 distance frequency would be respectively computed for 
the sequence H 3 H\H\H\H\H 3 H\H 2 H\H 2 . Finally, every protein subchain sequence can be encoded by 
the following feature vector: 

X H ~ [ X H, ' X H 2 ' X H } ] 5 X V \- X V i ' X V 2 ' X V 3 ] ' X P ~ [ X P, ' X P 2 ' X P 3 ] ' X Z ~ \- X Z { ' X Z 2 ' X Z y ] (3) 



Table 6. Amino acid groups classified according to their physicochemical value. 



Physicochemical property 


Group 1 


Group 2 




Group 3 


Hydrophobicity 


Hi. R,K,E,D,Q,N 


H 2 : G,A,S,T,P,H,Y 


H 3 


: C,V,L,I,M,F,W 


van der Waals volume 


Vi. G,A,S,C,T,P,D 


V 2 : N,V,E,Q,I,L 


V,: 


M,H,K,F,R,Y,W 


Polarity 


Pi: L,I,F,W,C,M,V,Y 


P 2 : P,A,T,G,S 


Py 


H,Q,R,K,N,E,D 


Polarizability 


Z x : G,A,S,D,T 


Z 2 : C,P,N,V,E,Q,I,L 


Z: 


K,M,H,F,R,Y,W 



Conveniently, the feature set based on hydrophobicity, normalized van der Waals volume, polarity, 
and polarizability can be written as Hf, Vf, Pf and Zf, respectively. In general, the dimensions of two 
feature vectors generated separately by two protein subchains are unequal. To solve this issue, we 
enlarge the feature vector dimension of one protein subchain such that it has a feature vector dimension 
equal to that of another subchain. For example, given the following protein subchain pair P a - Pb. 

Subchain P a amino acid sequence: MKEKEFQSKP 
Subchain P h amino acid sequence: QNSLALHKVIMVGSG 

If we adopt the property of hydrophobicity, then P a and Pb amino acid sequences can be translated 
into the following symbol sequence, respectively. 

Subchain P a : H3H X H\H X H\H3H\H 2 H X H 2 

Subchain P h : HiH x H 2 H 3 H 2 H 3 H 2 HiH 3 H 3 H 3 H 3 H 2 H 2 H 2 

Then, the distance sets of subchains P a and Pb are shown as: 

d a H ={1,1,1,2,2},^ ={2},dl ={5},dl ={1,6},^ = {2, 2, 6, 1,1}, < ={2,3,1,1,1,}, and the distance 
frequency vectors of subchains P a and Pb are as follows: 

X a ~ i X H l ' X H 2 ' X H 3 I' X b = \- X H x ' X H 2 ' X H, ] 

where 

x a =[3,2,0,0,0,0],x fl =[0,l,0,0,0,0],x a =[0,0,0,0,1,0], 
x b R =[l,0,0,0,0,l],x* =[2,2,0,0,0,1], x" h =[3,1,1,0,0,0] 

Hereinafter we will use "DF" to represent the distance frequency method by grouping amino acids 
with their physicochemical properties. 

By our use of DF to represent the protein subchain pair, we can see that the feature vector is sparse, 
while the vector dimension is large, when the subchain sequence is longer. To further extract the 
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features, Principal Component Analysis (PCA) was then used to reduce the dimension, and amino acid 
distance frequency combined with PCA reducing the dimension is now termed DFPCA. 

3.3. Amino Acid Index Distribution (AAID) 

Let I x , I 2 , . . . , I t , ■ ■ • , I 20 be the amino acid physicochemical value of the 20 natural amino acids a i (A, 

C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y), respectively, which can be accessed through 
the DBGET/LinkDB system by inputting an amino acid index (e.g., LEWP710101). An amino acid 
index is a set of 20 numerical values representing any of the different physicochemical and biochemical 
properties of amino acids. We can download these indices from the AAindex database 
(http ://www. genome .jp/aaindex/). 

For a given protein sequence P whose length is L , we replace each residue in the primary sequence by 
its amino acid physicochemical value, which results in a numerical sequence h l ,h 2 ,...,h l ,...,h L , 

(h,e I V I 2 ,...,I 20 ) . 

Then, we can define the following feature of amino acida,. to represent the protein sequences: 

w i = I i*f, (4) 

Where f i is the frequency of amino acid a i that occurs in protein sequecne P, I. is the physicochemical 
value of amino acid a i , and the symbol • indicates the simple product. f. and I. are mutually 

independent. Obviously, w, includes the physicochemical information and statistical information of 
amino acid a i , but it loses the sequence-order information. Therefore, to let feature vectors contain 

more sequence-order information, we introduced the 2-order center distance d i by considering the 

position of amino acid a i , which is defined as 

N "i k -k 

d.^C-^'hf (5) 

where N a is the total number of amino acid a i appearing in the protein sequence P, 

k tJ (J = \,2,-",N a ) is the y'th position of the amino acid a i in the sequence, and k t is the mean of the 

position of amino acid a i . 

Now feature di contains the physicochemical information, statistical information and the 
sequence-order information of amino acida, , but it still does not distinguish the protein pairs in some 

cases. For example, assume two protein pairs P a - Pt and P c - Pd- The sequences of protein P a , Pb, P c and 
Pd are respectively shown as: 

P a : MPPRNKPNRR; P b : MPNPRNNKPPGRKTR 
P c : MPRRNPPNRK; P d : MGTRPPRNNKPNPRK 

Obviously, P a and P c , as well as Pb and P d , have the same w t and d t . If we use the orthogonal sum 
vector, we cannot distinguish between the P a ~ Pb and P c - Pd protein pairs. To solve this problem, the 
3-order center distance U of amino acid or. was introduced, which is defined as 

<.=I<^".> 3 (6) 
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Finally, we can use a combined feature vector to represent protein sequence P by serializing above 
three features as 

x = [w 1 ,---,w i ,---,w 20 ,d l ,---,d i ,...,d 20 ,t l ,---,t i ,---,t 20 f (7) 

The protein pair P a - P b can now be represented by the following feature vectors: 

x ab = [ w °>'"> w 2t» d" d 20 ,t",---,t a 2o ,---,w 20 ,d b ,---,d 20 ,t b ,---,t 20 ] T (g) 

or 

x ba = [ wb 1 >'"> w 2o>di,---d 20 Ji,---J b 2o ,w",---,w 20 ,d?,--^ ^ 

Generally, vector x a t is not equal to vector x ba . As such, if a query protein pair P a - P b is represented 
by Xab and x ba respectively, the prediction results may be different. In this paper, we will choose the 
pairwise kernel function to solve this dilemma. 

3.4. Pairwise Kernel Function 

Ben-Hur and Noble [13] first introduced a tensor product pairwise kernel function K\ to measure the 
similarity between two protein pairs. The comparison between a pair (x p x 2 ) and another pair (x 3 ,x 4 ) 

for K\ is done through the comparison of x x with x 3 and x 2 with x 4 , on the one hand, and the comparison 

of x, with x 4 and x 2 with x 3 , on the other hand, as 

K l ((x, , x 2 ), (x 3 , x 4 )) = K(x l , x 3 ) • K(x 2 , x 4 ) + K(x l , x 4 ) • K(x 2 , x 3 ) (1 Q) 

However, the K\ kernel does not consider differences between the elements of comparison pairs in the 
feature space; therefore, Vert [50] proposed the following metric learning pairwise kernel K\\. 

K u ((x, ,x 2 ),(x 3 , x 4 )) = (Kfa , x 3 ) + K(x 2 , x 4 ) - K(x x , x 4 ) - K(x 2 , x 3 )) 2 ( \ \ ) 

In particular, two protein pairs might be very similar for the Ku kernel, even if the patterns of the first 
protein pair are very different from those of the second protein pair, whereas the K\ kernel could result in 
a large dissimilarity between the two protein pairs. It is easy to prove that the K\\ kernel satisfies both 
Mercer's condition and the pairwise kernel function condition. In this paper, we use the K\\ kernel 
function to predict PPL 

3.5. Assessment of Prediction System 

Sensitivity (S n ), specificity (S p ), positive predictive value (PPV) and total prediction accuracy 

(ACQ [39^11] were employed to measure the performance of PPI-PKS VM. 

TP 
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ACC = 



TP + TN 



TP + TN + FP + FN 



(15) 



MCC = 



TPxTN-FPxFN 



^{TP + FN)(TP + FP)(TN + FN)(TN + FT) 



(16) 



where TP and TN are the number of correctly predicted subchain pairs of interacting proteins and 
noninteracting proteins, respectively, and FP and FN are the number of incorrectly predicted subchain 
pairs of noninteracting proteins and interacting proteins, respectively. 

4. Conclusions 

In this work, we introduced two feature extraction approaches to represent the protein sequence. 
One is amino acid distance frequency with PCA reducing the dimension, termed DFPCA. Another is 
amino acid index distribution based on the physicochemical values of amino acids, termed AAID. 
The pairwise kernel function SVM was employed as the classifier to predict the PPIs. From the results, 
we can conclude that (i) the performance of DFPCA is better than that of DF; (ii) the prediction power of 
PRBF is superior to RBF, suggesting that designing a rational pairwise kernel function is important for 
predicting PPIs; (iii) DFPCA and AAID with pairwise kernel function SVM are effective and promising 
approaches for predicting PPIs and may complement existing methods. Since user-friendly and publicly 
accessible web servers represent the future direction in the development of predictors, we have provided 
a web server for PPI-PKSVM, and it can be found at (http://159.226.118.31/PPI/index.html). 
PPI-PKSVM in its present version can be used to evaluate one protein pair. However, we will soon be 
developing a newer online version able to predict large numbers of PPIs. 
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